Clustering and classification are visual ways of exploring statistical data. The aim of clustering is to find groups of data which are more similar (in some sense or another) to each other than to those in other groups.
This week I am using a built-in dataset, namely the Boston dataset which contains housing information in the Boston Mass area. The data has been collected by the U.S Census Service and it is also available online.
The Boston dataset consists of just 506 observations and there are 14 variables. Let us take a closer look at what those variables are.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The variable names are not exactly self-explanatory. Because our analysis will again depend upon them, I will briefly explain what they are.
| Variable | Description |
|---|---|
| crim | per capita crime rate by town |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town |
| chas | Charles River dummy variable (1 if tract bounds river; 0 otherwise) |
| nox | nitric oxides concentration (parts per 10 million) |
| rm | average number of rooms per dwelling |
| age | proportion of owner-occupied units built prior to 1940 |
| dis | weighted distances to five Boston employment centres |
| rad | index of accessibility to radial highways |
| tax | full-value property-tax rate per $10,000 |
| ptratio | pupil-teacher ratio by town |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
| lstat | % lower status of the population |
| medv | median value of owner-occupied homes in $1000’s |
Thus, the Boston dataset consists of a variety of information: demographic, economic, and environmental factors as well as safety. I will not discuss the variables in more detail just now, but save it for later when we explore the standardized dataset - the comparison between the original and scaled variables will be more meaningful if the summaries are presented together.
So, let us next take a graphical tour of the data. In the figure below, each variable is plotted against the other variables.
Here we see some interesting distribution patterns, such as
* accumulation close to the edges and corners of the box - e.g. age/lstat
* more curved shapes close to the lower left corner - e.g. nox/dis
* compact round shapes close to one of the corners - e.g. rad/tax
* binary positionas in all pairs for variables chas and rad.
Another way of approaching the relationships between variables is to look at their correlations. Below is a correlogram where, instead of numerical values, the correlations are presented as coloured spots. Red colour indicates a negative correlation and blue colour indicates a posititve correlation. The stronger the correlation, the bigger the spot.
Three pairs stand out as having a strong negative correlation: indus/dis, nox/dis and lstat/medv. It is interesting to note the connection between this and the previous graph. If we go back to the previous graph, we can see that these pairs are those which showed curved patterns.
As for positive correlations, there are many to choose from but one stands out from the rest: rad/tax. This was also identified above and it is the one with a compact round shape.
The first step before attempting actual clustering is to standardize the dataset. This is because we need the variables to be comparable. Standardization is a simple operation in which the values of each variable are calculated using the following formula:
\(scaled(x) = [x-mean(x)]/sd(x)\)
i.e. the variable mean is substracted from the original value of the variable and this difference is the divided by the standard deviation of the variable.
Now let us take a closer look at these scaled variables and compare them to the original, non-scaled, dataset. Here are summaries of the original and scaled variables (in this order).
## crim zn indus chas
## Min. : 0.0 Min. : 0.0 Min. : 0.46 Min. :0.000
## 1st Qu.: 0.1 1st Qu.: 0.0 1st Qu.: 5.19 1st Qu.:0.000
## Median : 0.3 Median : 0.0 Median : 9.69 Median :0.000
## Mean : 3.6 Mean : 11.4 Mean :11.14 Mean :0.069
## 3rd Qu.: 3.7 3rd Qu.: 12.5 3rd Qu.:18.10 3rd Qu.:0.000
## Max. :89.0 Max. :100.0 Max. :27.74 Max. :1.000
## nox rm age dis
## Min. :0.385 Min. :3.56 Min. : 2.9 Min. : 1.13
## 1st Qu.:0.449 1st Qu.:5.89 1st Qu.: 45.0 1st Qu.: 2.10
## Median :0.538 Median :6.21 Median : 77.5 Median : 3.21
## Mean :0.555 Mean :6.28 Mean : 68.6 Mean : 3.80
## 3rd Qu.:0.624 3rd Qu.:6.62 3rd Qu.: 94.1 3rd Qu.: 5.19
## Max. :0.871 Max. :8.78 Max. :100.0 Max. :12.13
## rad tax ptratio black lstat
## Min. : 1.00 Min. :187 Min. :12.6 Min. : 0 Min. : 1.7
## 1st Qu.: 4.00 1st Qu.:279 1st Qu.:17.4 1st Qu.:375 1st Qu.: 7.0
## Median : 5.00 Median :330 Median :19.1 Median :391 Median :11.4
## Mean : 9.55 Mean :408 Mean :18.5 Mean :357 Mean :12.7
## 3rd Qu.:24.00 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:396 3rd Qu.:17.0
## Max. :24.00 Max. :711 Max. :22.0 Max. :397 Max. :38.0
## medv
## Min. : 5.0
## 1st Qu.:17.0
## Median :21.2
## Mean :22.5
## 3rd Qu.:25.0
## Max. :50.0
## crim zn indus chas
## Min. :-0.42 Min. :-0.49 Min. :-1.556 Min. :-0.27
## 1st Qu.:-0.41 1st Qu.:-0.49 1st Qu.:-0.867 1st Qu.:-0.27
## Median :-0.39 Median :-0.49 Median :-0.211 Median :-0.27
## Mean : 0.00 Mean : 0.00 Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.01 3rd Qu.: 0.05 3rd Qu.: 1.015 3rd Qu.:-0.27
## Max. : 9.92 Max. : 3.80 Max. : 2.420 Max. : 3.66
## nox rm age dis
## Min. :-1.464 Min. :-3.88 Min. :-2.333 Min. :-1.27
## 1st Qu.:-0.912 1st Qu.:-0.57 1st Qu.:-0.837 1st Qu.:-0.80
## Median :-0.144 Median :-0.11 Median : 0.317 Median :-0.28
## Mean : 0.000 Mean : 0.00 Mean : 0.000 Mean : 0.00
## 3rd Qu.: 0.598 3rd Qu.: 0.48 3rd Qu.: 0.906 3rd Qu.: 0.66
## Max. : 2.730 Max. : 3.55 Max. : 1.116 Max. : 3.96
## rad tax ptratio black
## Min. :-0.982 Min. :-1.313 Min. :-2.705 Min. :-3.90
## 1st Qu.:-0.637 1st Qu.:-0.767 1st Qu.:-0.488 1st Qu.: 0.20
## Median :-0.522 Median :-0.464 Median : 0.275 Median : 0.38
## Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.00
## 3rd Qu.: 1.660 3rd Qu.: 1.529 3rd Qu.: 0.806 3rd Qu.: 0.43
## Max. : 1.660 Max. : 1.796 Max. : 1.637 Max. : 0.44
## lstat medv
## Min. :-1.53 Min. :-1.906
## 1st Qu.:-0.80 1st Qu.:-0.599
## Median :-0.18 Median :-0.145
## Mean : 0.00 Mean : 0.000
## 3rd Qu.: 0.60 3rd Qu.: 0.268
## Max. : 3.55 Max. : 2.987
## [1] "matrix"
The standardization procedure has affected the variables considerably. The range of values has been radically reduced, for example in age [2.9,100.0] -> [-2.333,1.116] or tax [187,711] -> [-1.313,1.796]. What is most noteworthy, though, is the change in the mean values - in the scaled dataset, the mean value is 0.000 for all variables. The variables are not aligned and thus more easily comparable.
I will use this scaled dataset from this point onwards. There are still some minor adjustments to be made. In the analysis to come, I will use crime rate as the target variable, and in order to do so the continuous variable crim needs to be converted into a new, categorical variable crime (after which the original variable will be removed, because all the other variables will serve as the explanatory variables). The quantiles of the scaled variables crim
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.42 -0.41 -0.39 0.00 0.01 9.92
will serve as the breakpoints of the new categorical variables crime:
## crime
## low med_low med_high high
## 127 126 126 127
Creating a model from your data is interesting by itself, but it falls short if there is nothing to test it on. Luckily, we can divide the dataset into training and test sets where the former is used for creating the model and the latter for testing the model’s prediction power. I will use 80% of my data for training and the remaining 20% for testing.
fit linear discriminant analysis to train set (categorical crime rate as target variable & all others as predictor variables) + draw LDA (bi)plot:
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.262 0.248 0.240 0.250
##
## Group means:
## zn indus chas nox rm age dis rad tax
## low 0.974 -0.905 -0.0866 -0.887 0.4588 -0.887 0.884 -0.690 -0.741
## med_low -0.107 -0.307 0.0426 -0.567 -0.0952 -0.342 0.309 -0.548 -0.475
## med_high -0.388 0.187 0.2553 0.393 0.0778 0.447 -0.365 -0.453 -0.336
## high -0.487 1.017 -0.1164 1.036 -0.4532 0.831 -0.868 1.638 1.514
## ptratio black lstat medv
## low -0.466 0.3830 -0.7648 0.536
## med_low -0.121 0.3187 -0.1936 0.047
## med_high -0.267 0.0804 0.0429 0.148
## high 0.781 -0.6543 0.8986 -0.691
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0630 0.7025 -1.0304
## indus 0.0329 -0.1325 0.2351
## chas -0.1013 -0.0895 0.0750
## nox 0.3605 -0.8075 -1.4696
## rm -0.1436 -0.0672 -0.1826
## age 0.1682 -0.3541 -0.1768
## dis -0.0245 -0.2903 -0.0200
## rad 3.6157 1.0492 -0.0716
## tax 0.1351 -0.0891 0.6960
## ptratio 0.1054 -0.0371 -0.4492
## black -0.1022 0.0710 0.1395
## lstat 0.3075 -0.3059 0.1675
## medv 0.2764 -0.4317 -0.2875
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9566 0.0333 0.0101
save crime categories from the test set:
Prediction predit the classes with LDA model on test data + crosstabulate with crime categories from the test set (= correct_classes):
## predicted
## correct low med_low med_high high
## low 11 10 0 0
## med_low 5 13 8 0
## med_high 1 6 19 3
## high 0 0 0 26
COMMENT ON RESULTS
K-means clustering reload Boston + standardize
calculate distances between observations:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.13 3.46 4.82 4.91 6.19 14.40
run k-means clustering [Euclidean]:
investigate optimal number of clusters & run k-means again: INTERPRET RESULTS
We were also given the code for producing a 3D plot. I did not have time to study it in detail, but I absolutely had to see what what it looks like, so here it is!
## [1] 404 13
## [1] 13 3
check colours (should be the crime classes of the train set)